It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
The file should contain a header and have the following format:
You can download an example submission file (sample_submission.csv) on the Data page.
The dataset comme from Kaggle; informations can be found on the following link :
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access to property
Grvl Gravel
Pave Paved
Alley: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
LandContour: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
Utilities: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
LandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
Neighborhood: Physical locations within Ames city limits
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker
Condition1: Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Condition2: Proximity to various conditions (if more than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
BldgType: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
HouseStyle: Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
OverallQual: Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
OverallCond: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
YearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
RoofStyle: Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
RoofMatl: Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
Exterior1st: Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Exterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
MasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
MasVnrArea: Masonry veneer area in square feet
ExterQual: Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
ExterCond: Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Foundation: Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
BsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
BsmtCond: Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
BsmtExposure: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
BsmtFinType1: Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
HeatingQC: Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
CentralAir: Central air conditioning
N No
Y Yes
Electrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
Kitchen: Kitchens above grade
KitchenQual: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality (Assume typical unless deductions are warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
GarageType: Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
GarageCond: Garage condition
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
PavedDrive: Paved driveway
Y Paved
P Partial Pavement
N Dirt/Gravel
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Fence: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
MiscFeature: Miscellaneous feature not covered in other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold (MM)
YrSold: Year Sold (YYYY)
SaleType: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
SaleCondition: Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)
# Importing working libraries
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
#importing classification models
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, Normalizer, StandardScaler
from sklearn.metrics import accuracy_score, auc, roc_curve, roc_auc_score, mean_squared_error, f1_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict, cross_validate
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion, Pipeline
from typing import List, Tuple
import warnings
warnings.filterwarnings("ignore")
#make charts show even when the notebook is exported
%matplotlib inline
def drop_features(features_to_drop:List[str], df:pd.DataFrame, drop_in_place:bool=True)->pd.DataFrame:
""" Drop features in a dataframe """
return df.drop(features_to_drop,axis=1,inplace=drop_in_place)
def encode_feature_values(feature_to_encode:str,df:pd.DataFrame, mapping:dict, encode_in_place:bool=True)->pd.DataFrame:
"""
mapping is a dict looking like :
mapping = {
'Jan':1,
'Feb':2,
'Mar':3,
'Apr':4,
'May':5,
'Jun':6,
'Jul':7,
'Aug':8,
'Sep':9,
'Oct':10,
'Nov':11,
'Dec':12,
}
"""
if feature_to_encode in df.columns:
return df[feature_to_encode].replace(mapping,inplace=encode_in_place)
else:
raise Exception("Feature {} doesn't exist in the dataframe".format(feature_to_encode))
# def plot_distribution_in_features(df:pd.DataFrame, fig_height:int=180, fig_width:int=20,
# subplot_ncols:int=2, subplot_nrows:int=None):
# """
# Plot distribution of each numerical feature in the given dataframe.
# """
# categorical_columns=[]
# numerical_columns=[]
# # Determine categorical features and numerical ones
# for i in range(len(df.columns)):
# column_name = df.columns[i]
# if df[column_name].dtype == "O" : # Column type is Categorical
# categorical_columns.append(column_name)
# else: # Column type is Numerical
# numerical_columns.append(column_name)
# # Compute the number of rows, which is equal to the number of numercial features
# if subplot_nrows is None:
# subplot_nrows=len(numerical_columns)
# plt.figure(figsize=(fig_width,fig_height))
# previous_used_line=0
# current_line=-1
# # Plot for each categorical feature
# for numerical_column in numerical_columns:
# plot_label_size = 15
# current_line+=1
# # Plot boxplot
# plt.subplot(subplot_nrows,subplot_ncols,current_line+previous_used_line+1)
# sb.boxplot(x=df.loc[:,numerical_column])
# plt.xlabel(numerical_column, size=plot_label_size)
# # Plot density plot
# plt.subplot(subplot_nrows,subplot_ncols,current_line+previous_used_line+2)
# df[numerical_column].plot(kind="density")
# plt.xlabel(numerical_column, size=plot_label_size)
# previous_used_line+=1
def plot_distribution_in_features(df:pd.DataFrame, fig_height:int=180, fig_width:int=20,
subplot_ncols:int=2, subplot_nrows:int=None, features_to_ignore:List[str]=[]):
"""
Plot distribution of each numerical feature in the given dataframe.
"""
categorical_columns=[]
numerical_columns=[]
#if len(features_to_ignore)>0:
# df.drop(features_to_ignore, axis=1,inplace=True)
# Determine categorical features and numerical ones
for i in range(len(df.columns)):
column_name = df.columns[i]
if df[column_name].dtype == "O" : # Column type is Categorical
categorical_columns.append(column_name)
else: # Column type is Numerical
numerical_columns.append(column_name)
# Exclude from the numerical_columns the values from features_to_ignore, in order to ignore those features
# when plotting the charts
for f in features_to_ignore:
if f in numerical_columns:
numerical_columns.remove(f) ## TODO : continue from here : remove f from numerical_columns
# Compute the number of rows, which is equal to the number of numercial features
if subplot_nrows is None:
subplot_nrows=len(numerical_columns)
plt.figure(figsize=(fig_width,fig_height))
previous_used_line=0
current_line=-1
# Plot for each categorical feature
for numerical_column in numerical_columns:
print("Processing column : {}".format(numerical_column))
plot_label_size = 15
current_line+=1
# Plot boxplot
plt.subplot(subplot_nrows,subplot_ncols,current_line+previous_used_line+1)
sb.boxplot(x=df.loc[:,numerical_column])
plt.xlabel(numerical_column, size=plot_label_size)
# Plot density plot
plt.subplot(subplot_nrows,subplot_ncols,current_line+previous_used_line+2)
df[numerical_column].plot(kind="density")
plt.xlabel(numerical_column, size=plot_label_size)
previous_used_line+=1
def plot_distribution_in_feature(df:pd.DataFrame, feature:str, fig_height:int=5, fig_width:int=20,
subplot_ncols:int=2, subplot_nrows:int=1):
"""
Plot distribution of the given feature
"""
plt.figure(figsize=(fig_width,fig_height))
column_name = feature
plot_label_size = 15
#ploting boxplot
plt.subplot(subplot_nrows,subplot_ncols,1)
sb.boxplot(x=df.loc[:,column_name])
plt.xlabel(column_name, size=plot_label_size)
#ploting density plot
plt.subplot(subplot_nrows,subplot_ncols,2)
df[column_name].plot(kind="density")
plt.xlabel(column_name, size=plot_label_size)
#plot_distribution_in_features(df=df,fig_height=180)
#df.LotArea.dtype
#df.LotArea.dtype == "int64"
#df.ExterQual.dtype
#df.ExterQual.dtype == "O"
#df.ExterQual.dtype == "int64"
#type(df.ExterQual)
#df.loc[:,"ExterQual"]
def get_outlier_indexes(df:pd.DataFrame, feature:str, data_points_scale:float=1.5,
lower_bound_quantile =.25, upper_bound_quantile=.75):
"""
Return outliers indexes in the given feature (column).
@params :
- df : the dataframe to analyse
- feature : the feature in which to search for outliers
- data_points_scale : data points scale to use in order to determine good values ranges
- lower_bound_quantile : quantile to use to determine good values' lower bound
- upper_bound_quantile : quantile to use to determine good values' upper bound
This function use the interquantile method in order to determine outliers.
"""
# Lower bound quantile
IQ1 = df[feature].quantile(lower_bound_quantile) #df[feature].quantile(0.25)
# Upper bound quantile
IQ3 = df[feature].quantile(upper_bound_quantile) #df[feature].quantile(0.75)
# Median (middle) value between lower and upper bound quantiles
IQR = IQ3-IQ1
# Calculate lower and upper bounds value
lower_bound = IQ1 - data_points_scale*IQR
upper_bound = IQ3 + data_points_scale*IQR
# Retrieve outliers indexes in the dataframe's feature
outlier_indexes = df.index[ (df[feature]<lower_bound) | (df[feature]>upper_bound) ]
return outlier_indexes
# original method
# def remove_outliers_in_features(df:pd.DataFrame, data_points_scale:float=1.5, drop_in_place:bool=True)->pd.DataFrame:
# """
# Remove outliers in all features of the given dataframe
# """
# # Get the outliers indexes in each feature
# outliers_indexes = []
# for feature in df.columns:
# outliers_indexes.extend( get_outlier_indexes(df,feature) )
# # Remove duplicate indexes
# outliers_indexes = set(outliers_indexes)
# print("Total number of outliers in data : {}".format(len(outliers_indexes)))
# # Drop samples wich have outlier in one of their features
# df = df.drop(outliers_indexes,inplace=drop_in_place, axis=0)
# print("df.shape after removing outliers : {}".format(df.shape))
# return df
def remove_outliers_in_features(df:pd.DataFrame, data_points_scale:float=1.5, drop_in_place:bool=True,
features_to_ignore:List[str]=[])->pd.DataFrame:
"""
Remove outliers in all features of the given dataframe
"""
print("df.shape before outliers removal : {}".format(df.shape))
categorical_columns=[]
numerical_columns=[]
#print("features_to_ignore")
#print(features_to_ignore)
if len(features_to_ignore)>0:
df.drop(features_to_ignore, axis=1,inplace=True)
# Determine categorical features and numerical ones
for i in range(len(df.columns)):
column_name = df.columns[i]
if df[column_name].dtype == "O" : # Column type is Categorical
categorical_columns.append(column_name)
else: # Column type is Numerical
numerical_columns.append(column_name)
# Get the outliers indexes in each numerical feature
outliers_indexes = []
for numerical_column in numerical_columns: #for feature in df.columns:
outliers_indexes.extend( get_outlier_indexes(df,feature=numerical_column) )
# Remove duplicate indexes
outliers_indexes = set(outliers_indexes)
print("Total number of outliers in data : {}".format(len(outliers_indexes)))
# Drop samples wich have outlier in one of their features
if drop_in_place:
df.drop(outliers_indexes,inplace=drop_in_place, axis=0)
else:
df = df.drop(outliers_indexes,inplace=drop_in_place, axis=0)
print("df.shape after outliers removal : {}".format(df.shape))
return df
#df.columns
#df = remove_outliers_in_features(df=df, features_to_ignore=features_to_ignore_when_dealing_with_outliers)
def remove_outliers_in_feature(df:pd.DataFrame, feature:str, data_points_scale:float=1.5,
drop_in_place:bool=True)->pd.DataFrame:
"""
Remove outliers in the wanted feature of the given dataframe
"""
if df[feature].dtype == "O" :
raise Exception("The feature {} is categorical, waiting for a numerical feature".format(feature))
print("df.shape before outliers removal : {}".format(df.shape))
# Retrieve outliers index in the given feature
outliers_indexes = []
outliers_indexes.extend( get_outlier_indexes(df,feature) )
outliers_indexes = set(outliers_indexes)
print("Total number of outliers in data : {}".format(len(outliers_indexes)))
#df = df.drop(outliers_indexes,inplace=drop_in_place, axis=0)
if drop_in_place:
df.drop(outliers_indexes,inplace=drop_in_place, axis=0)
else:
df = df.drop(outliers_indexes,inplace=drop_in_place, axis=0)
print("df.shape after outliers removal : {}".format(df.shape))
return df
def remove_upper_bound_outliers_in_feature(df:pd.DataFrame, feature:str, upper_bound:float)->pd.DataFrame:
"""
Remove outliers on the upper bound side in the wanted feature in the given dataframe
@return : new dataframe without outlier in the wanted feature
@params :
df : given dataframe
feature : feature in which to remove outlier
upper_bound : upper bound value where any right side value is considered outlier
"""
new_df = df.loc[df[column]<upper_bound]
remove_outliers_in_feature(new_df,feature=column)
plot_distribution_in_feature(new_df, feature=column)
return new_df
def remove_lower_bound_outliers_in_feature(df:pd.DataFrame, feature:str, lower_bound:float)->pd.DataFrame:
"""
Remove outliers on the upper bound side in the wanted feature in the given dataframe
@return : new dataframe without outlier in the wanted feature
@params :
df : given dataframe
feature : feature in which to remove outlier
upper_bound : lower bound value where any left side value is considered outlier
"""
new_df = df.loc[ df[column]>lower_bound ]
remove_outliers_in_feature(new_df,feature=column)
plot_distribution_in_feature(new_df, feature=column)
return new_df
def drop_features_with_one_occurence(df:pd.DataFrame, features_to_ignore:List[str]=["SalePrice"])->pd.DataFrame:
""" Drop features having only one occurence """
print(f"Shape before processing: {df.shape}")
for column in df.columns:
if column not in features_to_ignore:
if df[column].nunique()==1:
print(f"Dropping feature {column}")
df.drop([column],axis=1,inplace=True)
print(f"Shape after processing: {df.shape}")
return df
folder = "./house-prices-advanced-regression-techniques/"
train_file = "train.csv"
test_file = "test.csv"
# Show all column when calling df.head()
pd.set_option("display.max_columns",None)
df = pd.read_csv(folder+train_file)
df.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.0 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.0 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0.0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.0 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.0 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
df.shape
(1460, 81)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 1452 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
Quick remarks about the informations
# I will start by dropping features that have to little values to be used
useless_features = ["Alley","PoolQC","Fence","MiscFeature"]
drop_features(features_to_drop=useless_features,df=df,drop_in_place=True)
useless_features = []
df.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.0 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.0 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0.0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.0 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.0 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | 0 | 12 | 2008 | WD | Normal | 250000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 77 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 LotShape 1460 non-null object 7 LandContour 1460 non-null object 8 Utilities 1460 non-null object 9 LotConfig 1460 non-null object 10 LandSlope 1460 non-null object 11 Neighborhood 1460 non-null object 12 Condition1 1460 non-null object 13 Condition2 1460 non-null object 14 BldgType 1460 non-null object 15 HouseStyle 1460 non-null object 16 OverallQual 1460 non-null int64 17 OverallCond 1460 non-null int64 18 YearBuilt 1460 non-null int64 19 YearRemodAdd 1460 non-null int64 20 RoofStyle 1460 non-null object 21 RoofMatl 1460 non-null object 22 Exterior1st 1460 non-null object 23 Exterior2nd 1460 non-null object 24 MasVnrType 1452 non-null object 25 MasVnrArea 1452 non-null float64 26 ExterQual 1460 non-null object 27 ExterCond 1460 non-null object 28 Foundation 1460 non-null object 29 BsmtQual 1423 non-null object 30 BsmtCond 1423 non-null object 31 BsmtExposure 1422 non-null object 32 BsmtFinType1 1423 non-null object 33 BsmtFinSF1 1460 non-null int64 34 BsmtFinType2 1422 non-null object 35 BsmtFinSF2 1460 non-null int64 36 BsmtUnfSF 1460 non-null int64 37 TotalBsmtSF 1460 non-null int64 38 Heating 1460 non-null object 39 HeatingQC 1460 non-null object 40 CentralAir 1460 non-null object 41 Electrical 1459 non-null object 42 1stFlrSF 1460 non-null int64 43 2ndFlrSF 1460 non-null int64 44 LowQualFinSF 1460 non-null int64 45 GrLivArea 1460 non-null int64 46 BsmtFullBath 1460 non-null int64 47 BsmtHalfBath 1460 non-null int64 48 FullBath 1460 non-null int64 49 HalfBath 1460 non-null int64 50 BedroomAbvGr 1460 non-null int64 51 KitchenAbvGr 1460 non-null int64 52 KitchenQual 1460 non-null object 53 TotRmsAbvGrd 1460 non-null int64 54 Functional 1460 non-null object 55 Fireplaces 1460 non-null int64 56 FireplaceQu 770 non-null object 57 GarageType 1379 non-null object 58 GarageYrBlt 1379 non-null float64 59 GarageFinish 1379 non-null object 60 GarageCars 1460 non-null int64 61 GarageArea 1460 non-null int64 62 GarageQual 1379 non-null object 63 GarageCond 1379 non-null object 64 PavedDrive 1460 non-null object 65 WoodDeckSF 1460 non-null int64 66 OpenPorchSF 1460 non-null int64 67 EnclosedPorch 1460 non-null int64 68 3SsnPorch 1460 non-null int64 69 ScreenPorch 1460 non-null int64 70 PoolArea 1460 non-null int64 71 MiscVal 1460 non-null int64 72 MoSold 1460 non-null int64 73 YrSold 1460 non-null int64 74 SaleType 1460 non-null object 75 SaleCondition 1460 non-null object 76 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(39) memory usage: 878.4+ KB
# Number of N/A values
df.isna().sum()
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
...
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 77, dtype: int64
# Number of Null values
df.isnull().sum()
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
...
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 77, dtype: int64
Quick remarks about isna().sum() and isnull().sum()
Although there are many columns which results cannot be seen, it can be observed that :
Half of the values in FireplaceQu are null. Looking at it description (refer to Kaggle), Fireplace quality, it can be assumed that it may also play an important role in the price of the house. So I will replace its null values by a code (e.g. unknow).
Some values in LotFrontage are null. But I don't wish to drop their row, as I don't have many samples to begin with. So I will replaces its null values by 0 since it a numerical feature.
As for the other features which have null value, I will directly drop their samples as their null values are not so many.
Not dropping the sample with null values, but encoding the said null values, will not necessarely have a negative impact on my model later, as I will later use feature selection in order to select the best features for making a good prediction.
Dealing with null value in FireplaceQu
set(df["FireplaceQu"].values)
{'Ex', 'Fa', 'Gd', 'Po', 'TA', nan}
df["FireplaceQu"]
0 NaN
1 TA
2 TA
3 Gd
4 TA
...
1455 TA
1456 TA
1457 Gd
1458 NaN
1459 NaN
Name: FireplaceQu, Length: 1460, dtype: object
df["FireplaceQu"] = df["FireplaceQu"].fillna("Unknow_FireplaceQu")
set(df["FireplaceQu"].values)
{'Ex', 'Fa', 'Gd', 'Po', 'TA', 'Unknow_FireplaceQu'}
df["FireplaceQu"]
0 Unknow_FireplaceQu
1 TA
2 TA
3 Gd
4 TA
...
1455 TA
1456 TA
1457 Gd
1458 Unknow_FireplaceQu
1459 Unknow_FireplaceQu
Name: FireplaceQu, Length: 1460, dtype: object
Dealing with null value in LotFrontage
set(df["LotFrontage"].values)
{nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
21.0,
24.0,
30.0,
nan,
nan,
nan,
nan,
32.0,
33.0,
nan,
34.0,
nan,
nan,
35.0,
37.0,
38.0,
40.0,
41.0,
43.0,
44.0,
47.0,
48.0,
49.0,
50.0,
51.0,
52.0,
53.0,
54.0,
55.0,
56.0,
57.0,
58.0,
59.0,
60.0,
61.0,
62.0,
63.0,
64.0,
65.0,
66.0,
67.0,
68.0,
69.0,
70.0,
71.0,
72.0,
73.0,
74.0,
75.0,
76.0,
77.0,
78.0,
79.0,
80.0,
81.0,
82.0,
83.0,
84.0,
85.0,
86.0,
87.0,
88.0,
89.0,
90.0,
91.0,
92.0,
93.0,
94.0,
95.0,
96.0,
97.0,
98.0,
99.0,
100.0,
101.0,
102.0,
103.0,
104.0,
105.0,
106.0,
107.0,
108.0,
109.0,
110.0,
111.0,
112.0,
114.0,
115.0,
116.0,
118.0,
120.0,
121.0,
122.0,
nan,
nan,
nan,
nan,
nan,
128.0,
129.0,
130.0,
nan,
124.0,
134.0,
137.0,
138.0,
140.0,
141.0,
144.0,
149.0,
150.0,
152.0,
153.0,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
36.0,
160.0,
168.0,
174.0,
182.0,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
39.0,
nan,
nan,
42.0,
45.0,
46.0,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
313.0,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan}
df["LotFrontage"] = df["LotFrontage"].fillna(0)
set(df["LotFrontage"].values)
{0.0,
21.0,
24.0,
30.0,
32.0,
33.0,
34.0,
35.0,
36.0,
37.0,
38.0,
39.0,
40.0,
41.0,
42.0,
43.0,
44.0,
45.0,
46.0,
47.0,
48.0,
49.0,
50.0,
51.0,
52.0,
53.0,
54.0,
55.0,
56.0,
57.0,
58.0,
59.0,
60.0,
61.0,
62.0,
63.0,
64.0,
65.0,
66.0,
67.0,
68.0,
69.0,
70.0,
71.0,
72.0,
73.0,
74.0,
75.0,
76.0,
77.0,
78.0,
79.0,
80.0,
81.0,
82.0,
83.0,
84.0,
85.0,
86.0,
87.0,
88.0,
89.0,
90.0,
91.0,
92.0,
93.0,
94.0,
95.0,
96.0,
97.0,
98.0,
99.0,
100.0,
101.0,
102.0,
103.0,
104.0,
105.0,
106.0,
107.0,
108.0,
109.0,
110.0,
111.0,
112.0,
114.0,
115.0,
116.0,
118.0,
120.0,
121.0,
122.0,
124.0,
128.0,
129.0,
130.0,
134.0,
137.0,
138.0,
140.0,
141.0,
144.0,
149.0,
150.0,
152.0,
153.0,
160.0,
168.0,
174.0,
182.0,
313.0}
Dealing with Null and Duplicate values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 77 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1460 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 LotShape 1460 non-null object 7 LandContour 1460 non-null object 8 Utilities 1460 non-null object 9 LotConfig 1460 non-null object 10 LandSlope 1460 non-null object 11 Neighborhood 1460 non-null object 12 Condition1 1460 non-null object 13 Condition2 1460 non-null object 14 BldgType 1460 non-null object 15 HouseStyle 1460 non-null object 16 OverallQual 1460 non-null int64 17 OverallCond 1460 non-null int64 18 YearBuilt 1460 non-null int64 19 YearRemodAdd 1460 non-null int64 20 RoofStyle 1460 non-null object 21 RoofMatl 1460 non-null object 22 Exterior1st 1460 non-null object 23 Exterior2nd 1460 non-null object 24 MasVnrType 1452 non-null object 25 MasVnrArea 1452 non-null float64 26 ExterQual 1460 non-null object 27 ExterCond 1460 non-null object 28 Foundation 1460 non-null object 29 BsmtQual 1423 non-null object 30 BsmtCond 1423 non-null object 31 BsmtExposure 1422 non-null object 32 BsmtFinType1 1423 non-null object 33 BsmtFinSF1 1460 non-null int64 34 BsmtFinType2 1422 non-null object 35 BsmtFinSF2 1460 non-null int64 36 BsmtUnfSF 1460 non-null int64 37 TotalBsmtSF 1460 non-null int64 38 Heating 1460 non-null object 39 HeatingQC 1460 non-null object 40 CentralAir 1460 non-null object 41 Electrical 1459 non-null object 42 1stFlrSF 1460 non-null int64 43 2ndFlrSF 1460 non-null int64 44 LowQualFinSF 1460 non-null int64 45 GrLivArea 1460 non-null int64 46 BsmtFullBath 1460 non-null int64 47 BsmtHalfBath 1460 non-null int64 48 FullBath 1460 non-null int64 49 HalfBath 1460 non-null int64 50 BedroomAbvGr 1460 non-null int64 51 KitchenAbvGr 1460 non-null int64 52 KitchenQual 1460 non-null object 53 TotRmsAbvGrd 1460 non-null int64 54 Functional 1460 non-null object 55 Fireplaces 1460 non-null int64 56 FireplaceQu 1460 non-null object 57 GarageType 1379 non-null object 58 GarageYrBlt 1379 non-null float64 59 GarageFinish 1379 non-null object 60 GarageCars 1460 non-null int64 61 GarageArea 1460 non-null int64 62 GarageQual 1379 non-null object 63 GarageCond 1379 non-null object 64 PavedDrive 1460 non-null object 65 WoodDeckSF 1460 non-null int64 66 OpenPorchSF 1460 non-null int64 67 EnclosedPorch 1460 non-null int64 68 3SsnPorch 1460 non-null int64 69 ScreenPorch 1460 non-null int64 70 PoolArea 1460 non-null int64 71 MiscVal 1460 non-null int64 72 MoSold 1460 non-null int64 73 YrSold 1460 non-null int64 74 SaleType 1460 non-null object 75 SaleCondition 1460 non-null object 76 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(39) memory usage: 878.4+ KB
# Drop NA
df.dropna(inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1338 entries, 0 to 1459 Data columns (total 77 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1338 non-null int64 1 MSSubClass 1338 non-null int64 2 MSZoning 1338 non-null object 3 LotFrontage 1338 non-null float64 4 LotArea 1338 non-null int64 5 Street 1338 non-null object 6 LotShape 1338 non-null object 7 LandContour 1338 non-null object 8 Utilities 1338 non-null object 9 LotConfig 1338 non-null object 10 LandSlope 1338 non-null object 11 Neighborhood 1338 non-null object 12 Condition1 1338 non-null object 13 Condition2 1338 non-null object 14 BldgType 1338 non-null object 15 HouseStyle 1338 non-null object 16 OverallQual 1338 non-null int64 17 OverallCond 1338 non-null int64 18 YearBuilt 1338 non-null int64 19 YearRemodAdd 1338 non-null int64 20 RoofStyle 1338 non-null object 21 RoofMatl 1338 non-null object 22 Exterior1st 1338 non-null object 23 Exterior2nd 1338 non-null object 24 MasVnrType 1338 non-null object 25 MasVnrArea 1338 non-null float64 26 ExterQual 1338 non-null object 27 ExterCond 1338 non-null object 28 Foundation 1338 non-null object 29 BsmtQual 1338 non-null object 30 BsmtCond 1338 non-null object 31 BsmtExposure 1338 non-null object 32 BsmtFinType1 1338 non-null object 33 BsmtFinSF1 1338 non-null int64 34 BsmtFinType2 1338 non-null object 35 BsmtFinSF2 1338 non-null int64 36 BsmtUnfSF 1338 non-null int64 37 TotalBsmtSF 1338 non-null int64 38 Heating 1338 non-null object 39 HeatingQC 1338 non-null object 40 CentralAir 1338 non-null object 41 Electrical 1338 non-null object 42 1stFlrSF 1338 non-null int64 43 2ndFlrSF 1338 non-null int64 44 LowQualFinSF 1338 non-null int64 45 GrLivArea 1338 non-null int64 46 BsmtFullBath 1338 non-null int64 47 BsmtHalfBath 1338 non-null int64 48 FullBath 1338 non-null int64 49 HalfBath 1338 non-null int64 50 BedroomAbvGr 1338 non-null int64 51 KitchenAbvGr 1338 non-null int64 52 KitchenQual 1338 non-null object 53 TotRmsAbvGrd 1338 non-null int64 54 Functional 1338 non-null object 55 Fireplaces 1338 non-null int64 56 FireplaceQu 1338 non-null object 57 GarageType 1338 non-null object 58 GarageYrBlt 1338 non-null float64 59 GarageFinish 1338 non-null object 60 GarageCars 1338 non-null int64 61 GarageArea 1338 non-null int64 62 GarageQual 1338 non-null object 63 GarageCond 1338 non-null object 64 PavedDrive 1338 non-null object 65 WoodDeckSF 1338 non-null int64 66 OpenPorchSF 1338 non-null int64 67 EnclosedPorch 1338 non-null int64 68 3SsnPorch 1338 non-null int64 69 ScreenPorch 1338 non-null int64 70 PoolArea 1338 non-null int64 71 MiscVal 1338 non-null int64 72 MoSold 1338 non-null int64 73 YrSold 1338 non-null int64 74 SaleType 1338 non-null object 75 SaleCondition 1338 non-null object 76 SalePrice 1338 non-null int64 dtypes: float64(3), int64(35), object(39) memory usage: 815.3+ KB
# Drop duplicates
df.drop_duplicates()
df.shape
(1338, 77)
df.describe()
| Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.00000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.00000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |
| mean | 731.228700 | 56.136024 | 57.855755 | 10706.294469 | 6.219731 | 5.596413 | 1973.029148 | 1985.668909 | 110.360239 | 464.234679 | 49.218236 | 582.494768 | 1095.947683 | 1176.221973 | 357.141256 | 4.221973 | 1537.585202 | 0.438714 | 0.060538 | 1.575486 | 0.402840 | 2.864723 | 1.029895 | 6.550075 | 0.648729 | 1978.596413 | 1.87145 | 501.449925 | 99.384903 | 47.782511 | 21.263827 | 3.585949 | 16.43423 | 3.010463 | 42.932735 | 6.331839 | 2007.805680 | 186761.782511 |
| std | 421.779691 | 41.252576 | 35.187787 | 10336.621126 | 1.324472 | 1.078124 | 29.563540 | 20.296463 | 185.604816 | 458.792420 | 166.196584 | 439.950528 | 405.554435 | 386.644986 | 440.324982 | 40.712916 | 520.579960 | 0.515630 | 0.241685 | 0.547570 | 0.504185 | 0.776677 | 0.174697 | 1.585071 | 0.645393 | 24.773741 | 0.63428 | 186.761863 | 127.537065 | 65.362562 | 60.843964 | 30.224622 | 58.05159 | 41.961337 | 508.056255 | 2.699437 | 1.330691 | 78913.847668 |
| min | 1.000000 | 20.000000 | 0.000000 | 1300.000000 | 2.000000 | 2.000000 | 1880.000000 | 1950.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 105.000000 | 438.000000 | 0.000000 | 0.000000 | 438.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 | 0.000000 | 1900.000000 | 1.00000 | 160.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 35311.000000 |
| 25% | 366.250000 | 20.000000 | 41.000000 | 7744.000000 | 5.000000 | 5.000000 | 1956.000000 | 1968.000000 | 0.000000 | 0.000000 | 0.000000 | 248.000000 | 819.750000 | 894.000000 | 0.000000 | 0.000000 | 1160.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 5.000000 | 0.000000 | 1962.000000 | 1.00000 | 377.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 5.000000 | 2007.000000 | 135000.000000 |
| 50% | 730.500000 | 50.000000 | 64.000000 | 9600.000000 | 6.000000 | 5.000000 | 1976.000000 | 1994.500000 | 0.000000 | 413.000000 | 0.000000 | 489.000000 | 1021.500000 | 1098.000000 | 0.000000 | 0.000000 | 1480.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 | 1.000000 | 6.000000 | 1.000000 | 1980.000000 | 2.00000 | 484.000000 | 6.000000 | 28.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 168500.000000 |
| 75% | 1098.750000 | 70.000000 | 80.000000 | 11760.750000 | 7.000000 | 6.000000 | 2001.000000 | 2004.000000 | 174.000000 | 733.000000 | 0.000000 | 815.750000 | 1324.000000 | 1414.000000 | 740.500000 | 0.000000 | 1791.500000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 7.000000 | 1.000000 | 2002.000000 | 2.00000 | 583.000000 | 174.500000 | 70.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 220000.000000 |
| max | 1460.000000 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | 1474.000000 | 2336.000000 | 6110.000000 | 4692.000000 | 2065.000000 | 572.000000 | 5642.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 6.000000 | 3.000000 | 12.000000 | 3.000000 | 2010.000000 | 4.00000 | 1418.000000 | 857.000000 | 547.000000 | 552.000000 | 508.000000 | 480.00000 | 738.000000 | 15500.000000 | 12.000000 | 2010.000000 | 755000.000000 |
There are too many features to make a quick analysis with .describe() only. But the charts bellow will give us more informations about each feature.
plot_distribution_in_features(df=df)
Processing column : Id Processing column : MSSubClass Processing column : LotFrontage Processing column : LotArea Processing column : OverallQual Processing column : OverallCond Processing column : YearBuilt Processing column : YearRemodAdd Processing column : MasVnrArea Processing column : BsmtFinSF1 Processing column : BsmtFinSF2 Processing column : BsmtUnfSF Processing column : TotalBsmtSF Processing column : 1stFlrSF Processing column : 2ndFlrSF Processing column : LowQualFinSF Processing column : GrLivArea Processing column : BsmtFullBath Processing column : BsmtHalfBath Processing column : FullBath Processing column : HalfBath Processing column : BedroomAbvGr Processing column : KitchenAbvGr Processing column : TotRmsAbvGrd Processing column : Fireplaces Processing column : GarageYrBlt Processing column : GarageCars Processing column : GarageArea Processing column : WoodDeckSF Processing column : OpenPorchSF Processing column : EnclosedPorch Processing column : 3SsnPorch Processing column : ScreenPorch Processing column : PoolArea Processing column : MiscVal Processing column : MoSold Processing column : YrSold Processing column : SalePrice
❗Observation 1 :
Some of the variables are categorical in their nature, but they were later encoded to numercial variable, thus being also printed out when plotting the charts. Those variables 'outilier' will not be dealed with, since only continuous values can have outliers.
Those variables that will not be considered when dealing with outliers :
❗Observation 2 :
features_to_ignore_when_dealing_with_outliers = ["Id","MSSubClass","OverallQual","OverallCond","MasVnrType","GarageYrBlt",
"MoSold"]
# Replotting the outliers, to only have numerical variables plotted
plot_distribution_in_features(df=df,features_to_ignore=features_to_ignore_when_dealing_with_outliers)
Processing column : LotFrontage Processing column : LotArea Processing column : YearBuilt Processing column : YearRemodAdd Processing column : MasVnrArea Processing column : BsmtFinSF1 Processing column : BsmtFinSF2 Processing column : BsmtUnfSF Processing column : TotalBsmtSF Processing column : 1stFlrSF Processing column : 2ndFlrSF Processing column : LowQualFinSF Processing column : GrLivArea Processing column : BsmtFullBath Processing column : BsmtHalfBath Processing column : FullBath Processing column : HalfBath Processing column : BedroomAbvGr Processing column : KitchenAbvGr Processing column : TotRmsAbvGrd Processing column : Fireplaces Processing column : GarageCars Processing column : GarageArea Processing column : WoodDeckSF Processing column : OpenPorchSF Processing column : EnclosedPorch Processing column : 3SsnPorch Processing column : ScreenPorch Processing column : PoolArea Processing column : MiscVal Processing column : YrSold Processing column : SalePrice
df = remove_outliers_in_features(df=df, features_to_ignore=features_to_ignore_when_dealing_with_outliers)
df.shape before outliers removal : (1338, 77) Total number of outliers in data : 723 df.shape after outliers removal : (615, 70)
❗ Observation
More than half of the sample are considered outlier. And there is not much data left to work with after outliers were removed.
The way the data was collected should be reviewed, and tuned in order to improve the quality of the samples (have less outliers).
Another way to resolve the issue is to use Data Augmentation in order to generate new but good datas based on the original good samples.
Since the data come from Kaggle, the work will be continued as such, with the 615 samples
After the removal of outliers, only one occurence remains in some features. This generate a LinAlgError: singular matrix error when trying to plot them in a box plot with plot_distribution_in_features(), which is normal since all values are, as I said, the same. So those features shall immediatly been dropped.
# Dropping feature with one occurence
df = drop_features_with_one_occurence(df)
Shape before processing: (615, 70) Dropping feature Street Dropping feature Utilities Dropping feature RoofMatl Dropping feature BsmtFinType2 Dropping feature BsmtFinSF2 Dropping feature LowQualFinSF Dropping feature BsmtHalfBath Dropping feature KitchenAbvGr Dropping feature EnclosedPorch Dropping feature 3SsnPorch Dropping feature ScreenPorch Dropping feature PoolArea Dropping feature MiscVal Shape after processing: (615, 57)
# df.head().T
# df.columns
# Checking outliers status in the samples after the 1st removal
plot_distribution_in_features(df=df,features_to_ignore=features_to_ignore_when_dealing_with_outliers)
Processing column : LotFrontage Processing column : LotArea Processing column : YearBuilt Processing column : YearRemodAdd Processing column : MasVnrArea Processing column : BsmtFinSF1 Processing column : BsmtUnfSF Processing column : TotalBsmtSF Processing column : 1stFlrSF Processing column : 2ndFlrSF Processing column : GrLivArea Processing column : BsmtFullBath Processing column : FullBath Processing column : HalfBath Processing column : BedroomAbvGr Processing column : TotRmsAbvGrd Processing column : Fireplaces Processing column : GarageCars Processing column : GarageArea Processing column : WoodDeckSF Processing column : OpenPorchSF Processing column : YrSold Processing column : SalePrice
# Dropping feature with one occurence after outliers removal
df = drop_features_with_one_occurence(df)
Shape before processing: (615, 57) Shape after processing: (615, 57)
LotFrontage : Linear feet of street connected to property
column="LotArea"
plot_distribution_in_feature(df=df, feature=column)
print(sorted(set(df[column].to_list())))
[1869, 1890, 1920, 1953, 1974, 2001, 2016, 2117, 2160, 2268, 2280, 2289, 2308, 2368, 2448, 2500, 2522, 2572, 2628, 2651, 2665, 2887, 3010, 3013, 3072, 3136, 3182, 3196, 3316, 3363, 3500, 3600, 3604, 3675, 3684, 3735, 3922, 3982, 4017, 4045, 4060, 4224, 4251, 4282, 4400, 4403, 4426, 4435, 4438, 4500, 4590, 4608, 4671, 4800, 4928, 5001, 5063, 5100, 5105, 5119, 5232, 5310, 5330, 5381, 5389, 5400, 5436, 5664, 5684, 5720, 5814, 5820, 5925, 6000, 6120, 6171, 6173, 6180, 6240, 6292, 6305, 6371, 6435, 6442, 6600, 6629, 6762, 6792, 6853, 6897, 6911, 6931, 6951, 6955, 6993, 7000, 7024, 7032, 7052, 7056, 7064, 7100, 7134, 7150, 7153, 7162, 7180, 7200, 7226, 7227, 7244, 7252, 7313, 7314, 7332, 7340, 7350, 7388, 7390, 7406, 7415, 7420, 7438, 7472, 7500, 7558, 7560, 7577, 7590, 7599, 7658, 7677, 7685, 7728, 7742, 7750, 7758, 7795, 7800, 7804, 7837, 7838, 7844, 7861, 7862, 7875, 7892, 7910, 7917, 7922, 7931, 7937, 7950, 7990, 8000, 8004, 8029, 8063, 8068, 8072, 8120, 8121, 8123, 8125, 8158, 8160, 8172, 8176, 8190, 8197, 8198, 8199, 8200, 8244, 8250, 8314, 8320, 8366, 8385, 8400, 8445, 8450, 8461, 8471, 8475, 8487, 8499, 8500, 8520, 8521, 8529, 8536, 8544, 8546, 8556, 8562, 8593, 8600, 8633, 8640, 8658, 8688, 8712, 8724, 8738, 8740, 8741, 8749, 8750, 8760, 8765, 8767, 8773, 8775, 8791, 8795, 8800, 8814, 8816, 8846, 8877, 8880, 8892, 8899, 8900, 8923, 8924, 8925, 8935, 8967, 8978, 8990, 8993, 9000, 9017, 9018, 9037, 9042, 9056, 9060, 9066, 9069, 9084, 9100, 9101, 9120, 9125, 9135, 9144, 9156, 9158, 9179, 9196, 9200, 9206, 9230, 9236, 9245, 9248, 9262, 9286, 9291, 9303, 9313, 9317, 9337, 9340, 9353, 9360, 9364, 9375, 9382, 9450, 9477, 9480, 9505, 9520, 9525, 9531, 9541, 9548, 9554, 9560, 9571, 9587, 9588, 9591, 9600, 9636, 9672, 9675, 9750, 9764, 9800, 9803, 9808, 9819, 9828, 9855, 9900, 9920, 9937, 9938, 9945, 9947, 9950, 9967, 9991, 10000, 10005, 10007, 10041, 10084, 10120, 10134, 10140, 10142, 10150, 10171, 10182, 10186, 10192, 10200, 10205, 10207, 10215, 10226, 10237, 10240, 10261, 10264, 10289, 10335, 10355, 10356, 10364, 10382, 10386, 10400, 10402, 10420, 10434, 10448, 10463, 10496, 10530, 10562, 10570, 10603, 10625, 10637, 10652, 10656, 10665, 10721, 10762, 10769, 10780, 10784, 10800, 10839, 10846, 10852, 10859, 10880, 10921, 10927, 10930, 10991, 10994, 11000, 11003, 11049, 11050, 11096, 11103, 11120, 11170, 11175, 11194, 11200, 11207, 11210, 11214, 11216, 11235, 11248, 11250, 11308, 11310, 11316, 11317, 11333, 11344, 11367, 11404, 11428, 11435, 11478, 11600, 11616, 11625, 11645, 11664, 11700, 11764, 11767, 11796, 11839, 11880, 11883, 11885, 11900, 11911, 11932, 11957, 11988, 11999, 12000, 12003, 12095, 12122, 12150, 12160, 12180, 12182, 12224, 12244, 12256, 12274, 12327, 12328, 12342, 12393, 12394, 12420, 12456, 12464, 12513, 12537, 12552, 12665, 12735, 12803, 12925, 13006, 13031, 13072, 13125, 13132, 13159, 13173, 13265, 13286, 13346, 13350, 13418, 13450, 13501, 13517, 13673, 13695, 13710, 13758, 14000, 14054, 14112, 14115, 14175, 14191, 14200, 14215, 14260, 14587, 14598, 14803, 15256, 15306, 15384, 15426, 15523, 15578, 15611, 15750, 16059, 16158, 16196, 16226, 16259, 16285, 16560, 16647, 16737, 16770, 17043, 17104, 17140, 17400, 17600]
# Removing upper bound outliers
# upper_bound = 16000
# new_df = df.loc[df[column]<upper_bound]
# remove_outliers_in_feature(new_df,feature=column)
# plot_distribution_in_feature(new_df, feature=column)
new_df = remove_upper_bound_outliers_in_feature(df=df,feature=column, upper_bound=16000)
df.shape before outliers removal : (600, 57) Total number of outliers in data : 18 df.shape after outliers removal : (582, 57)
df = new_df
# df = new_df.copy(deep=True)
# Removing lower bound outliers
new_df = remove_lower_bound_outliers_in_feature(df=df,feature=column, lower_bound=2800)
df.shape before outliers removal : (571, 57) Total number of outliers in data : 5 df.shape after outliers removal : (566, 57)
df = new_df
# df = new_df.copy(deep=True)
MasVnrArea: Masonry veneer area in square feet
column = "MasVnrArea"
plot_distribution_in_feature(df=df, feature=column)
print(sorted(set(df[column].to_list())))
[0.0, 1.0, 11.0, 14.0, 16.0, 18.0, 24.0, 28.0, 31.0, 32.0, 36.0, 38.0, 40.0, 41.0, 42.0, 44.0, 45.0, 46.0, 48.0, 51.0, 53.0, 54.0, 57.0, 60.0, 65.0, 66.0, 68.0, 72.0, 74.0, 75.0, 76.0, 80.0, 82.0, 84.0, 85.0, 86.0, 88.0, 89.0, 90.0, 92.0, 94.0, 95.0, 98.0, 99.0, 100.0, 101.0, 104.0, 105.0, 106.0, 108.0, 110.0, 112.0, 113.0, 115.0, 116.0, 117.0, 120.0, 125.0, 128.0, 130.0, 132.0, 135.0, 136.0, 138.0, 140.0, 143.0, 145.0, 146.0, 147.0, 148.0, 149.0, 151.0, 153.0, 154.0, 156.0, 158.0, 160.0, 162.0, 163.0, 166.0, 167.0, 168.0, 169.0, 170.0, 174.0, 175.0, 176.0, 178.0, 180.0, 182.0, 183.0, 186.0, 189.0, 194.0, 196.0, 200.0, 202.0, 203.0, 204.0, 205.0, 206.0, 207.0, 208.0, 209.0, 210.0, 215.0, 218.0, 219.0, 220.0, 223.0, 224.0, 225.0, 226.0, 228.0, 230.0, 237.0, 238.0, 240.0, 243.0, 244.0, 245.0, 246.0, 248.0, 252.0, 255.0, 256.0, 258.0, 260.0, 266.0, 268.0, 270.0, 274.0, 280.0, 284.0, 285.0, 288.0, 289.0, 290.0, 292.0, 299.0, 300.0, 302.0, 306.0, 309.0, 318.0, 320.0, 328.0, 333.0, 335.0, 336.0, 338.0, 340.0, 342.0, 344.0, 348.0, 350.0, 360.0, 362.0, 366.0, 380.0, 391.0, 410.0, 412.0, 415.0, 420.0, 432.0, 435.0]
new_df = remove_upper_bound_outliers_in_feature(df=df,feature=column, upper_bound=380)
df.shape before outliers removal : (558, 57) Total number of outliers in data : 1 df.shape after outliers removal : (557, 57)
df = new_df
# df = new_df.copy(deep=True)
# Checking outliers status in the samples after the manual removal
plot_distribution_in_features(df=df,features_to_ignore=features_to_ignore_when_dealing_with_outliers)
Processing column : LotFrontage Processing column : LotArea Processing column : YearBuilt Processing column : YearRemodAdd Processing column : MasVnrArea Processing column : BsmtFinSF1 Processing column : BsmtUnfSF Processing column : TotalBsmtSF Processing column : 1stFlrSF Processing column : 2ndFlrSF Processing column : GrLivArea Processing column : BsmtFullBath Processing column : FullBath Processing column : HalfBath Processing column : BedroomAbvGr Processing column : TotRmsAbvGrd Processing column : Fireplaces Processing column : GarageCars Processing column : GarageArea Processing column : WoodDeckSF Processing column : OpenPorchSF Processing column : YrSold Processing column : SalePrice
column="LotFrontage"
plot_distribution_in_feature(df=df, feature=column)
# print(sorted(set(df[column].to_list())))
print(sorted(set(df[ df[column] > 120 ][column].to_list())))
[122.0, 129.0, 130.0]
new_df = remove_upper_bound_outliers_in_feature(df=df,feature=column, upper_bound=123)
df.shape before outliers removal : (555, 57) Total number of outliers in data : 0 df.shape after outliers removal : (555, 57)
df = new_df
# Checking outliers status in the samples after the manual removal
plot_distribution_in_features(df=df,features_to_ignore=features_to_ignore_when_dealing_with_outliers)
Processing column : LotFrontage Processing column : LotArea Processing column : YearBuilt Processing column : YearRemodAdd Processing column : MasVnrArea Processing column : BsmtFinSF1 Processing column : BsmtUnfSF Processing column : TotalBsmtSF Processing column : 1stFlrSF Processing column : 2ndFlrSF Processing column : GrLivArea Processing column : BsmtFullBath Processing column : FullBath Processing column : HalfBath Processing column : BedroomAbvGr Processing column : TotRmsAbvGrd Processing column : Fireplaces Processing column : GarageCars Processing column : GarageArea Processing column : WoodDeckSF Processing column : OpenPorchSF Processing column : YrSold Processing column : SalePrice
TotalBsmtSF¶column="TotalBsmtSF"
plot_distribution_in_feature(df=df, feature=column)
# print(sorted(set(df[column].to_list())))
print(sorted(set(df[ df[column] > 1500 ][column].to_list())))
[1501, 1502, 1504, 1517, 1520, 1525, 1536, 1541, 1552, 1554, 1561, 1566, 1567, 1568, 1571, 1573, 1574, 1580, 1582, 1588, 1595, 1604, 1610, 1614, 1616, 1622, 1623, 1626, 1638, 1652, 1656, 1682, 1686, 1698, 1700, 1704, 1705, 1719, 1734, 2000]
new_df = remove_upper_bound_outliers_in_feature(df=df,feature=column, upper_bound=1734+1)
df.shape before outliers removal : (554, 57) Total number of outliers in data : 0 df.shape after outliers removal : (554, 57)
df = new_df
# Checking outliers status in the samples after the manual removal
plot_distribution_in_features(df=df,features_to_ignore=features_to_ignore_when_dealing_with_outliers)
Processing column : LotFrontage Processing column : LotArea Processing column : YearBuilt Processing column : YearRemodAdd Processing column : MasVnrArea Processing column : BsmtFinSF1 Processing column : BsmtUnfSF Processing column : TotalBsmtSF Processing column : 1stFlrSF Processing column : 2ndFlrSF Processing column : GrLivArea Processing column : BsmtFullBath Processing column : FullBath Processing column : HalfBath Processing column : BedroomAbvGr Processing column : TotRmsAbvGrd Processing column : Fireplaces Processing column : GarageCars Processing column : GarageArea Processing column : WoodDeckSF Processing column : OpenPorchSF Processing column : YrSold Processing column : SalePrice
1stFlrSF¶column="1stFlrSF"
plot_distribution_in_feature(df=df, feature=column)
# print(sorted(set(df[column].to_list())))
print(sorted(set(df[ df[column] > 2000 ][column].to_list())))
[2117]
new_df = remove_upper_bound_outliers_in_feature(df=df,feature=column, upper_bound=2000)
df.shape before outliers removal : (553, 57) Total number of outliers in data : 0 df.shape after outliers removal : (553, 57)
df = new_df
# Checking outliers status in the samples after the manual removal
plot_distribution_in_features(df=df,features_to_ignore=features_to_ignore_when_dealing_with_outliers)
Processing column : LotFrontage Processing column : LotArea Processing column : YearBuilt Processing column : YearRemodAdd Processing column : MasVnrArea Processing column : BsmtFinSF1 Processing column : BsmtUnfSF Processing column : TotalBsmtSF Processing column : 1stFlrSF Processing column : 2ndFlrSF Processing column : GrLivArea Processing column : BsmtFullBath Processing column : FullBath Processing column : HalfBath Processing column : BedroomAbvGr Processing column : TotRmsAbvGrd Processing column : Fireplaces Processing column : GarageCars Processing column : GarageArea Processing column : WoodDeckSF Processing column : OpenPorchSF Processing column : YrSold Processing column : SalePrice
GarageArea¶column="GarageArea"
plot_distribution_in_feature(df=df, feature=column)
# print(sorted(set(df[column].to_list())))
print(sorted(set(df[ df[column] > 850 ][column].to_list())))
[852, 857, 860, 865, 866, 868, 870, 871, 878, 884, 888]
new_df = remove_upper_bound_outliers_in_feature(df=df,feature=column, upper_bound=870+1)
df.shape before outliers removal : (549, 57) Total number of outliers in data : 8 df.shape after outliers removal : (541, 57)
df = new_df
# Checking outliers status in the samples after the manual removal
plot_distribution_in_features(df=df,features_to_ignore=features_to_ignore_when_dealing_with_outliers)
Processing column : LotFrontage Processing column : LotArea Processing column : YearBuilt Processing column : YearRemodAdd Processing column : MasVnrArea Processing column : BsmtFinSF1 Processing column : BsmtUnfSF Processing column : TotalBsmtSF Processing column : 1stFlrSF Processing column : 2ndFlrSF Processing column : GrLivArea Processing column : BsmtFullBath Processing column : FullBath Processing column : HalfBath Processing column : BedroomAbvGr Processing column : TotRmsAbvGrd Processing column : Fireplaces Processing column : GarageCars Processing column : GarageArea Processing column : WoodDeckSF Processing column : OpenPorchSF Processing column : YrSold Processing column : SalePrice
SalePrice¶column="SalePrice"
plot_distribution_in_feature(df=df, feature=column)
# print(sorted(set(df[column].to_list())))
print(sorted(set(df[ df[column] > 310000][column].to_list())))
[313000, 315000, 316600, 317000, 320000, 324000, 325000, 325624, 328000, 328900, 335000, 337500, 339750]
new_df = remove_upper_bound_outliers_in_feature(df=df,feature=column, upper_bound=325000+1)
df.shape before outliers removal : (535, 57) Total number of outliers in data : 11 df.shape after outliers removal : (524, 57)
df = new_df
# Checking outliers status in the samples after the manual removal
plot_distribution_in_features(df=df,features_to_ignore=features_to_ignore_when_dealing_with_outliers)
Processing column : LotFrontage Processing column : LotArea Processing column : YearBuilt Processing column : YearRemodAdd Processing column : MasVnrArea Processing column : BsmtFinSF1 Processing column : BsmtUnfSF Processing column : TotalBsmtSF Processing column : 1stFlrSF Processing column : 2ndFlrSF Processing column : GrLivArea Processing column : BsmtFullBath Processing column : FullBath Processing column : HalfBath Processing column : BedroomAbvGr Processing column : TotRmsAbvGrd Processing column : Fireplaces Processing column : GarageCars Processing column : GarageArea Processing column : WoodDeckSF Processing column : OpenPorchSF Processing column : YrSold Processing column : SalePrice
Removing outliers in a feature tends to generate outliers in other feature. Since we make sure to remove nearly outliers in the output variable, we proceed with the next stage of the work.
# Recalling the data content
df.head().T
| 0 | 2 | 4 | 10 | 13 | |
|---|---|---|---|---|---|
| MSZoning | RL | RL | RL | RL | RL |
| LotFrontage | 65.0 | 68.0 | 84.0 | 70.0 | 91.0 |
| LotArea | 8450 | 11250 | 14260 | 11200 | 10652 |
| LotShape | Reg | IR1 | IR1 | Reg | IR1 |
| LandContour | Lvl | Lvl | Lvl | Lvl | Lvl |
| LotConfig | Inside | Inside | FR2 | Inside | Inside |
| LandSlope | Gtl | Gtl | Gtl | Gtl | Gtl |
| Neighborhood | CollgCr | CollgCr | NoRidge | Sawyer | CollgCr |
| Condition1 | Norm | Norm | Norm | Norm | Norm |
| Condition2 | Norm | Norm | Norm | Norm | Norm |
| BldgType | 1Fam | 1Fam | 1Fam | 1Fam | 1Fam |
| HouseStyle | 2Story | 2Story | 2Story | 1Story | 1Story |
| YearBuilt | 2003 | 2001 | 2000 | 1965 | 2006 |
| YearRemodAdd | 2003 | 2002 | 2000 | 1965 | 2007 |
| RoofStyle | Gable | Gable | Gable | Hip | Gable |
| Exterior1st | VinylSd | VinylSd | VinylSd | HdBoard | VinylSd |
| Exterior2nd | VinylSd | VinylSd | VinylSd | HdBoard | VinylSd |
| MasVnrArea | 196.0 | 162.0 | 350.0 | 0.0 | 306.0 |
| ExterQual | Gd | Gd | Gd | TA | Gd |
| ExterCond | TA | TA | TA | TA | TA |
| Foundation | PConc | PConc | PConc | CBlock | PConc |
| BsmtQual | Gd | Gd | Gd | TA | Gd |
| BsmtCond | TA | TA | TA | TA | TA |
| BsmtExposure | No | Mn | Av | No | Av |
| BsmtFinType1 | GLQ | GLQ | GLQ | Rec | Unf |
| BsmtFinSF1 | 706 | 486 | 655 | 906 | 0 |
| BsmtUnfSF | 150 | 434 | 490 | 134 | 1494 |
| TotalBsmtSF | 856 | 920 | 1145 | 1040 | 1494 |
| Heating | GasA | GasA | GasA | GasA | GasA |
| HeatingQC | Ex | Ex | Ex | Ex | Ex |
| CentralAir | Y | Y | Y | Y | Y |
| Electrical | SBrkr | SBrkr | SBrkr | SBrkr | SBrkr |
| 1stFlrSF | 856 | 920 | 1145 | 1040 | 1494 |
| 2ndFlrSF | 854 | 866 | 1053 | 0 | 0 |
| GrLivArea | 1710 | 1786 | 2198 | 1040 | 1494 |
| BsmtFullBath | 1 | 1 | 1 | 1 | 0 |
| FullBath | 2 | 2 | 2 | 1 | 2 |
| HalfBath | 1 | 1 | 1 | 0 | 0 |
| BedroomAbvGr | 3 | 3 | 4 | 3 | 3 |
| KitchenQual | Gd | Gd | Gd | TA | Gd |
| TotRmsAbvGrd | 8 | 6 | 9 | 5 | 7 |
| Functional | Typ | Typ | Typ | Typ | Typ |
| Fireplaces | 0 | 1 | 1 | 0 | 1 |
| FireplaceQu | Unknow_FireplaceQu | TA | TA | Unknow_FireplaceQu | Gd |
| GarageType | Attchd | Attchd | Attchd | Detchd | Attchd |
| GarageFinish | RFn | RFn | RFn | Unf | RFn |
| GarageCars | 2 | 2 | 3 | 1 | 3 |
| GarageArea | 548 | 608 | 836 | 384 | 840 |
| GarageQual | TA | TA | TA | TA | TA |
| GarageCond | TA | TA | TA | TA | TA |
| PavedDrive | Y | Y | Y | Y | Y |
| WoodDeckSF | 0 | 0 | 192 | 0 | 160 |
| OpenPorchSF | 61 | 42 | 84 | 0 | 33 |
| YrSold | 2008 | 2008 | 2008 | 2008 | 2007 |
| SaleType | WD | WD | WD | WD | New |
| SaleCondition | Normal | Normal | Normal | Normal | Partial |
| SalePrice | 208500 | 223500 | 250000 | 129500 | 279500 |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| LotFrontage | 524.0 | 55.059160 | 29.845849 | 0.0 | 43.00 | 65.0 | 75.00 | 122.0 |
| LotArea | 524.0 | 8795.305344 | 2466.310858 | 3010.0 | 7418.75 | 8831.0 | 10355.25 | 14803.0 |
| YearBuilt | 524.0 | 1981.062977 | 25.700545 | 1908.0 | 1962.00 | 1992.5 | 2004.00 | 2009.0 |
| YearRemodAdd | 524.0 | 1988.541985 | 19.842404 | 1950.0 | 1971.75 | 1999.0 | 2005.00 | 2010.0 |
| MasVnrArea | 524.0 | 66.561069 | 97.191292 | 0.0 | 0.00 | 0.0 | 125.00 | 360.0 |
| BsmtFinSF1 | 524.0 | 418.009542 | 385.942810 | 0.0 | 0.00 | 422.5 | 700.50 | 1567.0 |
| BsmtUnfSF | 524.0 | 606.826336 | 411.106145 | 0.0 | 280.00 | 513.5 | 867.25 | 1632.0 |
| TotalBsmtSF | 524.0 | 1024.835878 | 277.659013 | 290.0 | 839.75 | 970.0 | 1222.25 | 1734.0 |
| 1stFlrSF | 524.0 | 1081.125954 | 271.405765 | 438.0 | 864.00 | 1045.0 | 1274.50 | 1734.0 |
| 2ndFlrSF | 524.0 | 304.818702 | 403.555256 | 0.0 | 0.00 | 0.0 | 725.50 | 1349.0 |
| GrLivArea | 524.0 | 1385.944656 | 372.690716 | 438.0 | 1096.50 | 1375.5 | 1632.75 | 2520.0 |
| BsmtFullBath | 524.0 | 0.406489 | 0.491647 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| FullBath | 524.0 | 1.570611 | 0.510665 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 |
| HalfBath | 524.0 | 0.383588 | 0.486724 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| BedroomAbvGr | 524.0 | 2.767176 | 0.662568 | 1.0 | 2.00 | 3.0 | 3.00 | 4.0 |
| TotRmsAbvGrd | 524.0 | 6.196565 | 1.276912 | 3.0 | 5.00 | 6.0 | 7.00 | 10.0 |
| Fireplaces | 524.0 | 0.492366 | 0.558216 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| GarageCars | 524.0 | 1.803435 | 0.526095 | 1.0 | 1.00 | 2.0 | 2.00 | 3.0 |
| GarageArea | 524.0 | 470.612595 | 143.825834 | 160.0 | 384.00 | 474.0 | 572.00 | 840.0 |
| WoodDeckSF | 524.0 | 87.253817 | 99.113617 | 0.0 | 0.00 | 56.5 | 156.50 | 431.0 |
| OpenPorchSF | 524.0 | 39.555344 | 44.300844 | 0.0 | 0.00 | 30.0 | 63.00 | 174.0 |
| YrSold | 524.0 | 2007.774809 | 1.337378 | 2006.0 | 2007.00 | 2008.0 | 2009.00 | 2010.0 |
| SalePrice | 524.0 | 170569.209924 | 47420.931618 | 35311.0 | 134975.00 | 167620.0 | 200230.75 | 306000.0 |
We will first explore the distribution of our features.
df.columns.to_list()
['MSZoning', 'LotFrontage', 'LotArea', 'LotShape', 'LandContour', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'Exterior1st', 'Exterior2nd', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice']
MSZoning¶This feature identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
column="MSZoning"
df[ [column] ].describe()
| MSZoning | |
|---|---|
| count | 524 |
| unique | 5 |
| top | RL |
| freq | 437 |
houses_count_per_zone = df[ [column] ].value_counts()
houses_count_per_zone
MSZoning RL 437 RM 52 FV 30 RH 4 C (all) 1 dtype: int64
houses_count_per_zone.plot(kind="bar")
plt.xlabel("Zoning classification")
plt.ylabel("Houses count")
plt.title("Houses distribution per zone");
Residential Low Density zone. Our furture model will learn well on this zone type. Residential Medium Density and Floating Village Residential are very little in number. As for houses in Residential High Density and Commercial zones, they are nearly non-existant. We need to collect more samples with those values in order to allow our model to better learn them.# df.columns.to_list()
LotFrontage¶This feature indicates the linear feet of street connected to the property.
column="LotFrontage"
df[ [column] ].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 524 entries, 0 to 1455 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LotFrontage 524 non-null float64 dtypes: float64(1) memory usage: 8.2 KB
plt.hist(data=df, x=column)
plt.xlabel("Street connected to property (ft)")
plt.ylabel("Count")
plt.title("Distribution of homes based on linear feet of street connected to the property");
The majority of properties we have have 60 to 70 linear feet of street connected to them.